Improvement in TF-IDF scheme for Web Pages and its Retrieval Accuracy

نویسندگان

  • Kazunari SUGIYAMA
  • Kenji HATANO
  • Masatoshi YOSHIKAWA
  • Shunsuke UEMURA
چکیده

In IR (information retrieval) systems based on the vector space model, the tf-idf scheme is widely used to characterize documents. However, in the case of documents with hyperlink structures such as Web pages, it is necessary to develop a technique for representing the contents of Web pages more accurately by exploiting that of their hyperlinked neighboring pages. In this paper, we first propose some methods for improving the tf-idf scheme for a target Web page by using the contents of its hyperlinked neighboring pages, and then compare retrieval accuracy of our proposed methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Some Methods for Improving Feature Vectors for Web Pages and their Retrieval Accuracy

In IR (information retrieval) systems based on the vector space model, the tf-idf scheme is widely used to characterize documents. However, in the case of documents with hyperlink structures such as Web pages, it is necessary to develop a technique for representing the contents of Web pages more accurately by exploiting that of their hyperlinked neighboring pages. In this paper, we first propos...

متن کامل

A Method of Improving Feature Vector for Web Pages Reflecting the Contents of Their Out-Linked Pages

TF-IDF schemes are popular for generating the feature vectors of documents. These schemes are proposed for characterizing one document. Therefore, in order to characterize Web pages using tf-idf schemes, the feature vectors of the Web pages should be reflected by the contents of Web pages linked with other pages via hyperlinks. In this paper, we propose three methods of generating feature vecto...

متن کامل

REINA at WebCLEF2006. Mixing Fields to Improve Retrieval

This paper describes the participation of the REINA Research Group of the University of Salamanca at WebCLEF 2006. The task in that we have participated this year is the Monolingual Mixed Task in Spanish. To select web pages of the EuroGov collection in Spanish, the wide collection was processed with a language guesser, searching for pages in Spanish. All pages in the .es domain were also pre-s...

متن کامل

Toward improvement of SDR accuracy using LDA and query expansion for SpokenDoc

This paper investigates several techniques for spoken document retrieval, toward improvement of retrieval performance based on the conventional method i.e. TF-IDF. The first approach employs rescaled unigrams of LDA to compute a similarity score. The second technique employs query expansion by web retrieval using Yahoo!API. And the third technique is Prioritized And-operator Retrieval based on ...

متن کامل

Utilizing the Subjective Intent of Authoring Formats to Perform Focused Web Crawling

A successful web information retrieval system requires the ability to determine quickly and accurately whether a document or a link should be further explored. Current state-of-the-art web search engines typically use the meta-information in the HTML header to determine the relevancy of the documents. However, many documents on the web do not have such HTML header information. On the other hand...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003